Mining Statistically Significant Substrings using the 

Chi-Square Statistic 

Mayank Sachan Arnab Bhattacharya 

mayasac@cse.iitk.ac.in arnabb@iitk.ac.in 

Dept. of Computer Science and Engineering 
Indian Institute of Technology, Kanpur 
INDIA 



ABSTRACT 

The problem of identification of statistically significant patterns in 
a sequence of data has been applied to many domains such as intru- 
sion detection systems, financial models, web-click records, auto- 
mated monitoring systems, computational biology, cryptology, and 
text analysis. An observed pattern of events is deemed to be statis- 
tically significant if it is unlikely to have occurred due to random- 
ness or chance alone. We use the chi-square statistic as a quantita- 
tive measure of statistical significance. Given a string of characters 
generated from a memoryless Bernoulli model, the problem is to 
identify the substring for which the empirical distribution of sin- 
gle letters deviates the most from the distribution expected from 
the generative Bernoulli model. This deviation is captured using 
the chi-square measure. The most significant substring (MSS) of a 
string is thus defined as the substring having the highest chi-square 
value. Till date, to the best of our knowledge, there does not exist 
any algorithm to find the MSS in better than 0(n 2 ) time, where n 
denotes the length of the string. In this paper, we propose an al- 
gorithm to find the most significant substring, whose running time 
is <9(n 3/2 ) with high probability. We also study some variants of 
this problem such as finding the top-t set, finding all substrings hav- 
ing chi-square greater than a fixed threshold and finding the MSS 
among substrings greater than a given length. We experimentally 
demonstrate the asymptotic behavior of the MSS on varying the 
string size and alphabet size. We also describe some applications 
of our algorithm on cryptology and real world data from finance 
and sports. Finally, we compare our technique with the existing 
heuristics for finding the MSS. 

1. MOTIVATION 

Statistical significance is used to ascertain whether the outcome 
of a given experiment can be ascribed to some extraneous factors 
or is solely due to chance. Given a string composed of characters 
from an alphabet E = {ai , 122, ■ • • , of constant size k, the null 
hypothesis assumes that the letters of the string are generated from 
a memoryless Bernoulli model. Each letter of the string is drawn 
randomly and independently from a fixed multinomial probability 
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distribution P — {pi,p2, . . . ,Pk} where pt denotes the probabil- 
ity of occurrence of character dj in the alphabet Q^Pi = !)• The 
objective is to find the connected subregion of the string (i.e., a sub- 
string) for which the empirical distribution of single letters deviates 
the most from the distribution given by the Bernoulli model. 

Detection of statistically relevant patterns in a sequence of events 
has drawn significant interest in the computer science community 
and has been diversely applied in many fields including molecular 
biology, cryptology, telecommunications, intrusion detection, au- 
tomated monitoring, text mining, and financial modeling. The ap- 
plications in computational biology include assessing the over rep- 
resentation of exceptional patterns [7] and studying the mutation 
characteristics in the protein sequence of an organism by identify- 
ing the sudden changes in their mutation rates [18]. Different stud- 
ies suggest detecting intrusions in various information systems by 
searching for hidden patterns that are unlikely to occur [26, 27]. In 
telecommunication, it has been applied to detect periods of heavy 
traffic [13]. It has also been used in analyzing financial time series 
to reveal hidden temporal patterns that are characteristic and pre- 
dictive of time series events [22] and to predict stock prices [17]. 

Quantifying a substring as statistically significant depends on the 
statistical model used to calculate the deviation of the empirical 
distribution of single letters from its expected nature. The exact 
formulation of statistical significance depends on the metric used; 
p-value and z-score [23, 25] represent the two most commonly used 
ones (some of the other ones are reviewed in [10, 24]). Research 
indicates that in most practical cases, p-value provides more precise 
and accurate results as compared to z-score [7]. 

The p-value is defined as the probability of obtaining a test statis- 
tic at least as extreme as the one that was actually observed assum- 
ing the null hypothesis to be true. For example, in an experiment 
to determine whether a coin is fair, suppose it turns up head on 19 
out of 20 tosses. Assuming the null hypothesis, i.e., the coin is fair, 
to be true, the p-value is equal to the probability of observing 19 or 
more heads in 20 flips of a fair coin: 1 

(20\ , (20\ 

p-value = Pr(19H) + Pr(20H) = 9 220 — « 0.002% 

Traditionally, the decision to reject or fail to reject the null hypoth- 
esis is based on a pre-defined significance level a. If the p-value is 
low, the result is less likely assuming the null hypothesis to be true. 
Consequently, the observation is statistically more significant. 



'This definition of p-value is part of a one-sided test; however, we 
can also calculate the probability of getting at least 19 heads or at 
least 19 tails which is part of a two-sided test. The p-value is just 
double in this case due to symmetry. 
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In a memoryless Bernoulli multinomial model, the probability 
of observing a configuration fio, given by a count vector C = 
{Yi, Y 2 , . . . , Y fe } with J2i=i Y i = 1 (where I is the length of the 
substring) denoting the set of observed frequencies of each charac- 
ter in the alphabet, is defined as 



Pr(C = A») = »II 



Yi! 



The p- value for this model then is 



p- value : 



E 



(1) 



(2) 



/3 more extreme than /3r) 



However, computing the p- value exactly requires analyzing all pos- 
sible outcomes of the experiment which are potentially exponential 
in number, thereby rendering the computation impractical. More- 
over, it has been shown that for large samples, asymptotic approxi- 
mations are accurate enough and easier to calculate [24]. 

The two broadly used approximations are the likelihood ratio 
statistic and the Pearson 's chi-square statistic [24] . In case of like- 
lihood ratio test, an alternative hypothesis is set up under which 
each pi is replaced by its maximum likelihood estimate m — Xi /n 
with the exact probability of a configuration under null hypothesis 
defined similarly as in the previous case. The natural logarithm of 
the ratio between these two probabilities multiplied by —2 is then 
the statistic for the likelihood ratio test: 



-2 ln(LR) = -2 V Xi In ( — \ 

i=i VP./ 



(3) 



Alternatively, the Pearson's chi-square statistic, denoted by X , 
measures the deviation of observed frequency distribution from the 
theoretical distribution [5]: 



X 



= E 



(Pi - E t ) 2 
Ei 



= E 



(Y - lPi? 
Ipi 



(4) 



where Oi and Ei are theoretical and observed frequencies of the 
characters in the substring. Since each letter of the substring is 
drawn from a fixed probability distribution, the expected frequency 
Ei of a character in the substring is obtained by multiplying the 
length of the substring I with the probability of occurrence of that 
character. Hence, the expected frequency vector is given by E — 
IP, where P = {pi,p2, ■ ■ ■ ,Pk}- The chi-square (X 2 ) definition 
in (4) can be further simplified as: 



* 2 = E 



(Y - IPif 
Ipi 



i=l yi i=l i=l 



V 2 



V 2^ Y = I and ^^Pi = 1 



(5) 



Note that the chi-square value for a substring depends only on the 
count of the characters in it, and not on the order in which they 
appear. It can be seen in the coin toss example that all the outcomes 
that are less likely to occur have higher X 2 values than the observed 
outcome. For multinomial models, under the null hypothesis, both 
X 2 statistic and —2 ln(LR) statistic converge to the \ 2 distribution 
with k — 1 degrees of freedom [21, 24]. Hence, the p- value of the 
outcome can then be computed using the cumulative distribution 
function (cdf) F(x) of the \ 2 (k - 1) distribution. If z is the X 2 
value of the observed outcome, then its p- value is 1 — F(zo). 

Moreover, it has also been shown that the X 2 statistic converges 
to the x 2 distribution from below as opposed to the — 21n(Li?) 



statistic which converges from above [21, 24]. Thus, the chi-square 
statistic diminishes the probability of type-I errors (false positives). 
Considering these significant advantages, we adopt the Pearson's 
X 2 statistic as the estimate to quantify the statistical significance 
in our study. 

In this paper, we focus on the problem where only portions of the 
string instead of the whole string may deviate from the expected 
behavior. As discussed in the experimental section, this problem 
is particularly useful in the analysis of temporal strings where an 
external event occurring in the middle of a string may be causing 
the particular substring to deviate significantly from the expected 
behavior by inflating or deflating the probabilities of occurrence of 
some characters in the alphabet. Our work focuses on the problem 
of identification of such statistically significant substrings in large 
strings. Before venturing forward, we formally define the different 
problem statements handled in this paper for a string S of length n. 

Problem 1 (Most Significant Substring). Find the 
most significant substring (MSS) ofS, which is the substring having 
the highest chi-square value (X 2 ) among all possible substrings. 

Problem 2 (Top-t substrings). Find the top-t set T oft 
substrings such that \T\ = t and for any two arbitrary substrings 
Si € T and S 2 <£ T, X 2 Sl > X% 2 . 

Problem 3 (Significance greater than threshold). 
Find all substrings having chi-square value (X 2 ) greater than a 
given threshold ao. 

Problem 4 (MSS greater than given length). Find 
the substring having the highest chi-square value (X 2 ) among all 
substrings of length greater than 70. 

The rest of the paper is organized as follows. Section 2 provides 
an overview of the related work. Section 3 formulates some impor- 
tant definitions and observations used by our algorithm. Section 4 
describes the algorithm for finding the MSS of a string. Section 5 
presents the analysis of the algorithm. Section 6 extends the MSS 
finding algorithm to the more general problems. Section 7 shows 
the experimental analysis and some applications of the algorithm 
on real datasets. Finally, Section 8 discusses possible future work. 

2. RELATED WORK 

The problem of identifying frequent and statistically relevant 
subsequences (not necessarily contiguous) in a sequence has been 
an active area of research over the past decade [19]. The problem 
of finding statistically significant subsequences within a window 
of size w has also been addressed [3, 15]. Since the number of 
subsequences grows exponentially with w, the task of computing 
subsequences within a large window is practically infeasible. 

We address a different version of the problem where the window 
size can be arbitrarily large but statistically significant patterns are 
constrained to be contiguous, thus forming substrings of the given 
string. The problem has many relevant applications in places where 
the extraneous factor that triggers such unexpected patterns occur 
continuously over an arbitrarily large period in the course of a se- 
quence, as in the case of temporal strings. As the possible number 
of substrings reduces to 0(n 2 ), the problem of computing statisti- 
cally significant patterns becomes much more scalable. However, 
it is still computationally intensive for large data. 

The trivial algorithm proceeds by checking all 0(n 2 ) possible 
substrings. Some improvements such as blocking technique and 
heap strategy were proposed, but they showed no asymptotic im- 
provement in the time complexity [2]. Two algorithms, namely, 
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ARLM and AGMM, were proposed which use local maxima to 
find the MSS [9]. It was claimed (only through a conjecture and 
not a proof) that ARLM would find the MSS. However, the time 
complexity is still 0(n 2 ) with only constant time improvements. 
AGMM was a O(n) time heuristic that found a substring whose 
X 2 value was roughly close to the X 2 value of MSS, but no the- 
oretical guarantees were provided on the bound of the approxima- 
tion ratio. The comparative analysis of our algorithms with them 
is shown in detail in Section 7. To the best of our knowledge, no 
algorithm exists till date that exactly finds the MSS or solves the 
other variants of the problem in better than 0(n 2 ) time. 

It may seem that a fast algorithm can be obtained using the suf- 
fix tree 2 [14]. However, the problem at hand is different. To com- 
pute the X 2 value of any substring we need not traverse the whole 
substring; rather, we just need the number of occurrences of each 
character in that substring. This can be easily computed in 0(1) 
time by maintaining k count arrays, one for each character of the 
alphabet, where i th element of the array stores the number of oc- 
currences of the character till i th position in the string. Each array 
can be preprocessed in 0(n) time. Furthermore, due to complex 
non-linear nature of the X 2 function we assume that no obvious 
properties of the suffix trees or its invariants can be utilized. 

The trivial algorithm checks for all possible substrings that have 
0(n) starting positions and for each starting position have 0(n) 
ending positions, thus requiring 0(n 2 ) time. Our algorithm also 
considers all the 0(n) starting positions, but for a particular start- 
ing position, it does not check all possible ending positions. Rather, 
it skips ending positions that cannot generate candidates for the 
MSS or the top-t set. We show that for a particular starting po- 
sition, we check only 0(y / n) different ending positions, thereby 
scanning a total of only 0(n 3//2 ) substrings. We formally show 
that the running time of our algorithm is 0(n 3 ^ 2 ). We also extend 
the algorithm for finding the top-t substrings and other variants, all 
of which, again, run in 0(n 3 ^ 2 ) time. 

3. DEFINITIONS AND OBSERVATIONS 

In the rest of the paper, any string S over a multinomial alphabet 
S = {oi, a,2, ■ ■ ■ , a*:} and drawn from a fixed probability distribu- 
tion P — {pi,P2, ■ ■ ■ ,Pk} is phrased as "5 1 over (E,P)". For a 
given string S of length n, S[i] (1 < i < n) denotes the i th letter 
of the string S and S[i . . . j] denotes the substring of S from index 
i to index j, both included. So, the complete string S can also be 
denoted by S[l . . . n}. 

Definition 1 (Chain cover). For any string S of length 
I, a string \(S,cii,li) of length I + h is said to be the chain cover 
of S over h symbols of character en if S is the prefix of\(S, a», Zi) 
and the last li positions of X(S,ai,h) are occupied by the char- 
acter a;. Alternatively, X(S, cii,li) is of the form S followed by h 
occurrences of character ai. 

For example, if S — edebbe then X(S, d, 3) = edebbeddd, and 
if S = baaed then X(S, a, 2) = baacdaa. 

We first prove that for any string S of length I, X 2 value of any 
string S' of length less than or equal to I + h and having S as its 
prefix is upper bounded by the X 2 value of a chain cover of S over 
li symbols of some character a t G E. 



2 A suffix tree is a data structure that can be built in 6(n) time. The 
power of suffix trees lies in quickly finding a particular substring 
of the string. It provides a fast implementation of many important 
string operations. 



LEMMA 1. Let S be any given string of length I over (T,,P) 
with count vector denoted by { Y\ , Y2 , . . . , Y k } where each Yi > 
and J2i=i Yi — I- Let S' be any string which has S as its prefix 
and is of length I + h. Then there exists some character dj 6 S 
such that X 2 value of S' is upper bounded by the X 2 value of the 
cover string X(S,aj,h). The character aj is such that it has the 
maximum value of 2Y ^ +il among all j £ { 1 , 2, . . . , k}. 

PROOF. Let the X 2 values of strings S, S' and X(S, aj,h) be 
denoted by X%, X 2 S , and X\ respectively. We need to prove that 
X%, <X 2 . 

By definition, the count vector of X(S, aj, h) is {Yi, Y2, . . . , Yj + 
h, . . . Yk}. Further, let Y/ denote the frequency of character ai in 
S' that are not present in S (i.e., frequency of in the li length suf- 
fix of S'). So, the count vector of S' is { Yi + Y{ , Y 2 + Y 2 ' . . . , Y k + 
Y fc '} where each Y„ > and 5Z»=i = h- From the definition 
of X 2 statistic given in (5), we have 



= E 



Y' 1 



- I 



(6) 



Y 2 

1 IT, 



Al + h 



. (I + h)Pm ' (I + ll)Pj 



+ 



- (l + h) 



+ 



^ (I + h)p m (I + h)pj 



-(l + h) 



(7) 



^ (l + h)pr, 



= E 



Y 2 

1 IT, 



E 



(/ + h)pm ^ {I + h)Pm 

— 1 v ' m — 1 v ' 



-(l + h) (8) 



The character aj is chosen such that it maximizes the quantity 
2Y ^ +il over all possible alphabets. So for any other character a m 
where me {1, 2, . . . , k] we have 



2Y m + Y^ < 2Y m + h < 2Yj + h 

Prn Pm. Pj 

Multiplying (9) by and summing it over m we get 



E 



9Y Y' 4- Y >2 



< 



, 2Y,- + ii 2Y^i + li 



< 



m — 1 ' m — 1 

From (7), (8) and (10) we have 



Pi 



Pi 



(9) 



(10) 



X%,< £ 



2Yjh + 1 2 , , , . ■' 



\ (I + h)Pm (I + h)Pj 



-(l + h)=X 2 x . 



□ 



The next lemma states that the X value of a string can always 
be increased by adding a particular character to it. 

LEMMA 2. Let S be any given string of length I over (E,P) 
with count vector denoted by { Yi , Y2 , . . . , Yk} where each Y > 
and ^2 i=1 Yi = I. There always exists some character aj such that 
by appending it to S, the X 2 value of resultant string S' becomes 
greater than that of S. The character aj is such that it has the 
maximum value of ^ among all j € {1, 2, . . . , fe}. 
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PROOF. Let the X 2 values of strings S and S' be denoted by 
Xg and X s , respectively. We need to prove that X| < X s ,. 
The string S' is the resultant string obtained by appending alpha- 



bet aj to the string S, so the count vector of S' is {Yi, 
1, . . . , Yk}. From (5), we have 



v 2 _ ^™ 

m=l 



I 



(11) 



Xi, 



= E 



y 2 



2Y 7 + 1 



- (Z + l)Pm (I + l)pj 

From (11) and (12) we have 



+ (12) 



X2 T/-z 
o/ — .Ac 



E 



+ 



2K + 1 



(I + l)Pm (i + 1)P. 



E|f + ' 

m — 1 



/(i + i) 



(2y,- + 1); + ^ 

Pi m=1 P™ 



(13) 



for i = n to 1 do 
for Z = to n - 

I' + l 



i do 



X? <-X 2 value of £[*... Z'] 



ifXi 1 >xi 



then 



Y 2 4— Y 2 



Algorithm 1 Algorithm for finding the most significant substring 
(MSS) 

1 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 

13 
14 
15 
16 
17 
18 



2 

mi 

end if 

m s.t. Vm G {1, 2, . . 



,fc}, 



is maximum 



■l—pt 
2Y t - 2lp t 

X~, 



■ p 



2a 

Increment Z by a; 

end for 

i 4- i - 1 
end for 
return XiL, 



- Pt A ; 

« 1 



2 

max 



The character a, is chosen such that it maximizes over all j. So 
we have 

Y 

± 777 

Multiplying (14) by F m and summing it over m we get 



< — Vm G {1,2, 



,*} 



(14) 



JL v 2 v. 

7^1 P™ ^ t'l 



^2 
Pj 



(15) 



Putting (15) into (13) we get 

,,2 »,q > 1 



1(1 + l) Pj 
1 



(2^ + 1)1 - 1(1 + l)p 3 - IY 3 



1(1 + l) Pj 
Again, from (14) we have: 



] 

: [z(y J -Zp,) + z(i- Pj )] (16) 



Y m pj < Yjp m =^ pj ^y m < Yj J]p™^ Ipj < Yj (17) 



Putting (17) into (16) and using pj < 1, we get 



X 2 S > 0. 



□ 



In the next result, we show that the X 2 value of any string S' 
having S as its prefix is upper bounded by X 2 value of the chain 
cover of S. 

THEOREM 1. Let S be any given string of length I over (E, P) 
with count vector denoted by { Y\ , Y2 , ■ . . , Yk } where each Yi > 
and Yli=i Yi ~ Further, let S' be any string which has S as 
its prefix and is of length less than or equal to I + Zi. Then there 
exists some character aj G E such that X 2 value of S' is less 
than X(S,aj,h). The character aj is such that it has the maximum 



value of 



among all j G {1,2... k}. 



PROOF. The proof follows directly from the results stated in 
Lemma 1 and Lemma 2. From Lemma 2, we can say that there 
always exists a character such that appending it increases the X 2 
value of S'. Hence, we keep appending the string S' with such 



characters till its length becomes Z + Zi. We call the resultant string 
S c . Clearly, S c has S as its prefix and is of length Z + Zi and X 2 
value of S" is less than or equal to X 2 value of S c . The character 



o,- is such that maximizes 



over all j G {1, 2, . . . ,k}; 



using Lemma 1, we can say that the X 2 value of S c is less than the 
X 2 value of X(S, a,j,h). This further implies that X 2 value of S' 
is less than or equal to the X 2 value of X(S, aj , Zi). □ 



We next formally describe our algorithm for finding the most 
significant substring (MSS). 



4. THE MSS ALGORITHM 

The algorithm looks for the possible candidates of MSS in an or- 
dered fashion. The pseudocode is shown in Algorithm 1. The loop 
in line 2 iterates over the start positions of the substrings while the 
loop in line 3 iterates over all the possible lengths of the substrings 
from a particular start position. We keep track of the maximum X 2 
value of any substring computed by our algorithm by storing it in 
a variable X^ ax . For a given substring S[i . . . I'], we calculate its 
X 2 value, which is stored in X? (line 5). If X'f turns out to be 



greater than X„ 



is updated accordingly (line 7). 



The character at is chosen such that it maximizes the value of 

2Y- 4-x 

— 1 — over all j (line 9 of the pseudocode). This property is nec- 
Pj ■ 

essary for the application of the result stated in Theorem 1. De- 
noting the X 2 value of a chain cover of S[i . . . I'] over x symbols 
of character at by X 2 , the result stated in Theorem 1 states that 
the X value of any substring of the form S[i . . . (I' + m)] for 
rn G {0,1, ... ,x} is upper bounded by X 2 . We choose x such 
that it is maximized within the constraint that X 2 is guaranteed to 
be less than or equal to X max . Then, under the given constraint, 
we can skip checking all substrings of the form S[i . . . (I' + m)] for 
rn G {0, 1, . . . , x} as their X 2 values are not greater than X max . 
So, we directly increment Z by x (line 14). Next, we find out what 
the ideal choice of x is. 

We denote the count vector of substring S[i . . . I'] of length Z 
by {Yi, Y2, • • • , Y k }. The count vector of cover chain is given by 
{Yi, Y2 . . . , Y t + x, . . . , Yk} where Y t denotes the frequency of 
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character a in the algorithm. By definition of X from (5), 



X 



E 



(Yrr, 



\ IPr, 



(18) 



and 



xi 



{Y„, 



2 2xY t + x 2 

+ -n~, — \ (1 + x) 



^ (I + x)p m (I + x)pt 
l(Xf + I) 2xY t + x 2 



(I + x) (1 + x)p t 



(l + x) 



(19) 



We want to maximize x with the constraint that X 2 < X max . 
From (19) we have, 

l(Xf + l) 2xY t +x 2 (] ,, <Y 2 n(n 

—77-. + -77-. s (l+X) < X max (20) 

(I + X) [l + x)p t 

On multiplying (20) by (/ + x)p t and rearranging, the constraint 
simplifies to 

(1 - p t )x 2 + (2Y t - 2lp t - p t X 2 max )x + {X? - X max )lp t < 

(21) 

Eq. (21) is a quadratic equation in x with a = 1 — pt > 0, b = 
2Y t - 2lp t - p t X 2 max and c = (X 2 - X 2 max )lp t < (X 2 < 
Xmax)- We need to maximize x with the constraint that ax 2 + 
bx + c< 0. Thus, we choose x as the positive root of the quadratic 
equation: 



-b + Vb 2 -4ac 
2a 



(22) 



Since a > and c < we have x > 0. Further, since x has to be 
an integer we choose x as the greatest integer greater than or equal 
to the above value (line 13 of the algorithm). 

5. ANALYSIS OF THE MSS ALGORITHM 

We first show that the running time of the algorithm on an input 
string generated from a memoryless Bernoulli model is 0{kn' i ^ 2 ) 
with high probability where n and k denote the string and alphabet 
size respectively. For a string not generated from the null model, 
we will argue that the time taken by our algorithm on that string is 
less than the time taken by our algorithm on an equivalent string 
of the same size generated from the null model. Hence, the time 
complexity of our algorithm for any input string is O'kn 3 ^ 2 ) with 
high probability. 

Let S be any string drawn from a memoryless Bernoulli model. 
Let Tij denote the random variable that takes value 1 if a, occurs 
at position S[j] and otherwise. Each character of the string S 
is independently drawn from a fixed probability distribution P, so 
the probability that Tij = 1 is pi. The frequency of character Oj 
in the string S denoted by the random variable Yi is the sum of n 
Bernoulli random variables Tj where j ranges from 1 to n. Since 
Yi is the sum of n i.i.d. (independent and identically distributed) 
Bernoulli random variables, each having a success probability pi, 
Yi follows a binomial distribution with parameters n and p;. 



Tj ~ Bernoulli(pi) 

=> Yi ~ Binomial(n,pi) (23) 



Yi — y ^ Tj — 
i=i 

We state the following two standard results from the domain of 
probability distributions. 



THEOREM 2. For large values of n, the Binomial(n,p) distri- 
bution converges to Normal( fi,a 2 ) distribution with the same mean 
and variance, i.e., fj, = np and a 2 = np(l — p). 

PROOF. The proof uses the result of Central Limit Theorem. 
Please refer to [4] for the detailed proof. 3 □ 

It has been shown in [1] that for both n and np greater than 
a constant 4 , the binomial distribution can be approximated by the 
normal distribution. Since all the probabilities pi in our setting 
are fixed, we can always find a constant (say c) such that for all n 
greater than c, every Xi ~ N(npi, npi{\ — Pi)) distribution. We 
use the following result to obtain the distribution of the X 2 statistic 
of any substring from a string generated using the null model. 

THEOREM 3. Let the random variable Yi, i £ {1,2... k} fol- 
lows N(npi,npi(l — pt)) distribution with ^2 i=1 Pi = 1 and the 
additional constraint that ^2 i=1 Yi — n. The random variable 



X 



= E 



(Y -n Pl ) 2 
npi 



(24) 



then follows the chi-square distribution with (k — 1) degrees of 
freedom, denoted by ^{k — 1). 

PROOF. It has to be noted that all Yi's in the theorem are not 
independent but have an added constraint that J2i=i ^ = n - This 
is precisely the reason why the degrees of freedom of chi-square 
distribution is — 1 instead of k. A well known result is that the 
sum of squares of n independent standard normal random variables 
follows a x (k) distribution. The proof (which is slightly compli- 
cated) follows directly from this well known result. Please refer to 
[20] for the detailed proof. □ 

We will next prove that with high probability, the X 2 value of 
the MSS of S generated using the null model is greater than In n. 
However, before that, we prove another useful result using elemen- 
tary probability theory. 

LEMMA 3. Let Z max denote the maximum of m i.i.d. random 
variables following X 2 (k) distribution. Then with probability at 
least 1 — 0(l/m 2 ), for sufficiently large m and for any constant 

C > 0, In Cm < Zmax- 

PROOF. We first show this for k = 2. Let f(x) and F(x) denote 
the pdf and cdf of x 2 (2) distribution: 

1 



f{x;2)=~-e-*' 2 F(x;2) = l- 

We have 

Z max = max{Z 1 ,Z 2 , ■ ■ ■ , Z m }Vi, Zi 
For any constant c > we have: 

Pr{Z ma x > In cm} = Pr{3i, s.t. Zi > In cm} 
= 1 - Pr{\/i,Zi < In cm} = 1 - (Pr{Zi < In cm}) 7 



r/2 



X 2 (k) 



(25) 



(26) 



(1 



> 1 - e" 



^ In cm ^ m 



(1 



> 1 - 0(l/m 2 ) 



(27) 



In the above proof we only utilized the asymptotic behavior of 
pdf and cdf of the X 2 (^) distribution. Since for any general k, 

3 In the above approximation, we can think of the binomial distri- 
bution as the discrete version of the normal distribution having the 
same mean and variance. So we do not need to account for the 
approximation error using the Berry-Esseen theorem [8]. 
4 In general, the value of this constant is taken as 5 [1]. 
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the asymptotic behavior of pdf and cdf of X^ik) distribution has 
the same dominating term e~ x ^ 2 , the above result is valid for any 
given fc. 5 □ 

LEMMA 4. In the MSS algorithm, at any iteration in the loop 
over i, X^ ax > In n' with probability at least 1 — O ( 1 /n' 2 ) where 
n' — n — i. 

PROOF. We can verify from the pseudo code (Algorithm 1) that 
before we begin the loop in line 2 for i = io, we have checked 
all the substrings that are potential candidates for MSS of S start- 
ing at i > io. So, at this instance, the variable X^ ax stores the 
maximum X value of any substring of the string S[(io + 1) . . . n] . 



independent Bernoulli random variables Tij each with expec- 
tation p t ; so, E[Y t ] = lp t . Also, we have Pr(T ij G [0, 1]) = 
1. Now, using the Hoeffding's inequality [16], we get 

2t 2 l 2 

Pr{Y t - E[Y t ] <t}>l-e ^=i (i --'' )2 (29) 

Substituting E[Y t ] = lpt, t = \ yjlpt In /, a, = and h — 1, 
we have for any constant £2 > 



I / 2! Pi 

Pr{Y t - lp t < -s/lpt InZ} > 1 - e r 



2lp t In I 
61 



= l-is- >l-e 2 (30) 



n C2 — 0(n' 2 ) (where n' — n—io) random variables each follow- 
ing the same x (fc - 1) distribution. However, since these Oiri ) 
substrings are not mutually independent, the result of Lemma 3 
cannot be directly applied in this case. 

However, we can still say that a subset of at least O(n') sub- 
strings are independent, with each substring following a/ffc- 1) 
distribution. One way of constructing a mutually independent sub- 
set of size O(n') is by choosing n'/c substrings each of length c 
such that they do not share any character among them, i.e., the i' h 
substring in this set is S[(ci — c) . . . (ci — 1))] where c is a constant 
such that the binomial distribution can be approximated by the nor- 
mal distribution for all strings of length greater than or equal to c. 
Since all characters of the string S are drawn independently from 
a fixed probability distribution, all the substrings in the subset are 
mutually independent, and since length of all these substrings are 
greater than c, X statistics of these substrings follow the \ {k— 1) 
distribution. Consequently, the value of X^ ax in our algorithm is 
greater than the max of at least O(n') y/ 2 (fc — 1) i-i.d. random vari- 
ables. Putting the value of m = n'/c in the result of Lemma 3, we 
can prove the above result. □ 

LEMMA 5. On an input string generated from the null model, 
with high probability (> 1 — efor any constant e > 0) the number 
of substrings skipped (denoted by x) in any iteration of the loop on 
I in the MSS algorithm is uj{yl) for sufficiently large values of I. 
Hence, e can be set so close to that with probability practically 
equal to 1, the number of substrings skipped x in any iteration is at 
least 

PROOF. As stated in (22), the number of substrings skipped in 
any iteration of the loop on I is 



-b + Vb 2 - 4ac 
2o 



(28) 



We will prove that in the string generated from the null model, 
with high probability b < \ y/lpt In I and c > — \lpt In/. These 
bounds help us in guaranteeing that x = u)(Vl) with high proba- 
bility. In order to prove the bounds on b and c, we first prove that 
the following conditions hold with high probability. 

(i) From the result stated in Lemma 4, for any constant 61 > 0, 
we have with probability at least 1 — 0(1/ n' 2 ) > 1 — £1 that 

n — i. In the algorithm, / in the 
- i, so we have I < n' . Hence, 
Xmax > l n ' w ith probability at least 1 — ei. 

(ii) Suppose Y t denotes the frequency of alphabet a t in the string 
S[i... I'] of length /. As denoted in (23) it is the sum of I 



X max > ln n' where n' 
loop iterates from to n 



5 The term of x k ^ 2 1 e x ^ 2 occurring in pdf of a general fc is 
asymptotically less than e^ x ^ 2+€ and greater than e~ x ^ 2 ~ e for any 
e > 0, which is independent of fc. 



(iii) As stated in Theorem 3, the X 2 value of substring S[i . . . I'] 
of length I denoted by X? follows the \ 2 distribution. Fur- 
ther, using the definition of cdf of x 2 distribution denoted by 
F x , we have for any constant £3 > 

Pr{x!< 1 ^} = F x ( ] ^) = l-e-^>l-e 3 (31) 

We choose constants ei, £2 and £3 small enough such that for any 
constant e > 0, 1 — £1 — £2 — £3 > 1 — e. Thus combining the 
above three conditions, the following results hold with probability 
1-e: 

b = 2(Y t - lp t ) - Pt X 2 max < 2(Y t - lp t ) < X -^Jlp t In/ (32) 



c = lp t (Xf - X 2 max ) < Ip«(i ln J - In I) < - X -lpt In/ (33) 



1-Pt < 1 



(34) 



We use the fact that if any positive x satisfies the equation a! x + 
b'x + c! < then it also satisfies the equation ax 2 + bx + c < 
if a < a', b < b' and c < c'. So substituting upper bounds of a, 
b and c in (28) and maximizing x in (28) we have with probability 

1 - £ 



x> \(^j\lp t \nl + 2lp t ln / - ^ sjlpt InZ) 



\ y/ipt in / = n(VThd) = w (v7) 



(35) 
□ 



Further, in Algorithm 1, except line 9, all the steps inside the 
loop over / in line 3 can be performed in constant time. However, 
if we can determine the frequencies of all of the characters in the 
substring S[i .../'] in O(l) time, then we can find the character 
at (line 9) in O(k) time. For this purpose, we maintain one count 
array for each character at, Vt = 1, . . . , fc, where the i th element 
of the count array stores the number of occurrence of a t up to the 
i th position in the string. Each count array can be preprocessed in 
0(n) time. Consequently, each iteration of the loop over / in line 3 
takes 0(k) time. Further the loop over i in line 2 iterates n times. 
Now, we only need to compute the number of iterations of the loop 
over / for which we use the next lemma. 

LEMMA 6. The expected number of iterations of the loop on I 
(in line 3 of the MSS algorithm) for each value ofi is 0(^/n). 
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Algorithm 2 Algorithm for finding the top-t substrings 
T Min Heap on t elements all initialized to 
for i = n to f do 
for I — to n — i do 

I' <- i + 1 

X 2 max _ t <- Find Min(T) 
X? «- X 2 value of S[i . . . /'] 
ifXf > X^ ax _ t then 
Extract Min(T) 
Insert X? in T 
end if 

t 4- m s.t. Vm G {1,2, ... , fc}, 
a ^— 1 — Pt 

b <- 2Y t - 2lp t - Pt Xl ax _ t 
c «- (Xf - X^ ax t )lp t 

2a 

Increment I by x 
end for 
z ^ — i — 1 
end for 
return T 



2Y m + 

Vm 



is maximum 



Algorithm 3 Algorithm for finding all substrings having X value 
greater than ao 



1 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 

13 

14 
15 
16 
17 
18 



S ao *~ 4> 

for i — n to I do 
for I — to n — i do 

/' «- i + / 

Xf 4- X' 2 value ofS[i. 
if Xf > a then 

Sa <-S ao US[i...l'] 



./'] 



end if 



a 
6- 
c < 



m s.t. Vm G {I, 2, . . 
-1-Pt 

- 2Y t — 1lp t — ptcto 

- (Xf - a )lp t 

-6+yp 2 -4c 



,fc}, 



is maximum 



Increment I by x 

end for 

i <— i — I 
end for 
return S an 



PROOF. Let T(r) be the number of iterations of the loop over / 
required for / to reach r. We have shown in Lemma 5 that in each 
iteration, the number of substrings skipped x is cu (Vl). Thus, I in 
the next iteration will reach from r to r + w(i/r). This gives us the 
following recursive relation: 



T(r + cVr) < T(r) + O(l) = T(r) + q 



(36) 



It can be shown that the solution to the above relation is 0(^/n). 
Please refer to Lemma 7 in the appendix for detailed proof. □ 

Since each iteration of the loop over I in line 3 takes 0(y/n) 
time, the time taken by the algorithm on an input string generated 
by the null model is 0(fcn 3 ^ 2 ) which is 0(n 3 ^ 2 ) since k is taken 
as a constant in our problem setting. Thus, we have shown that the 
running time of the algorithm on an input string generated from a 
memoryless Bernoulli model is 0(kn 3 ^ 2 ) with high probability. 

5.1 Nature of the String 

As it can be verified from the definition, the X 2 value of a sub- 
string increases when the expected and observed frequencies begin 
to diverge. Thus, the individual substrings of a string not gener- 
ated from the null model are expected to have higher X 2 values 
which, in turn, increases the X^ ax . Further, it can be verified 
from (22) that the number of substrings skipped, x, increases on 
increasing X^ laa , as we have to maximize x such that the constraint 
X 2 < Xfnax is satisfied. If X^,„ is large, it gives a larger win- 
dow for X 2 which allows the choice of a larger x. Hence, the time 
taken by our algorithm on an input string not generated from null 
model is less than the time taken by our algorithm on an equivalent 
string of the same size generated from the null model. So, the time 
complexity of our algorithm remains 0(n 3 ^ 2 ) and is independent 
of the nature of the input string. Section 7.1.2 gives the details on 
how our algorithms perform on different types of strings. 

6. OTHER VARIANTS OF THE PROBLEM 
6.1 Top-t Substrings 

The algorithm for finding the top-t statistically significant sub- 
strings (Algorithm 2) is same as the algorithm for finding the MSS 



except that X^ ax t stores the t th largest X 2 value among all sub- 
strings seen till that particular instant by the algorithm. We main- 
tain a min-heap T of size t for storing the top-t X 2 values seen by 
the algorithm. The heap T is initially empty and Xf^ ax t always 
stores the top (minimum) element of the heap. If Xf is computed 
to be greater than X^ ax , then we extract the minimum element of 
T (which now no more is a part of top-t substrings) and insert the 
new Xf value into the heap. Now, X^ ax t points to the new mini- 
mum of the heap. Finally, at the end of the algorithm we return the 
heap T which contains the top-t X 2 values among all the substrings 
of string S. 

The analysis of this algorithm is same as the algorithm for MSS 
except that we now need to show that X^ ax t is greater than In n 
with probability greater than any constant. This still holds true for 
any t < oj(n) (please refer to Lemma 8 in the appendix for de- 
tailed proof). Moreover, inside the for loop on I, we now perform 
insertion and extract-min operations on a heap T of size t; so each 
iteration of the loop over I now requires 0(k + log t) time. Thus, 
the total time complexity of the algorithm for finding the top t sub- 
strings is 0((k + logi)n 3/2 ) fort < w(n). 

6.2 Significance Greater Than a Threshold 

The algorithm for finding all substrings having X 2 value greater 
than a threshold q (Algorithm 3) is again essentially the same as 
the MSS algorithm except that the X^ aa . constantly remains a at 
every iteration. We maintain S ao as a set of all substrings having 
X 2 value greater than ao. We skip all substrings that cannot be 
a part of S ao , i.e., whose cover strings have X 2 value not greater 
than ao. 

Next, we analyze the time complexity of the algorithm on vary- 
ing ao. We again revert to (22): 



-b + Vb 2 - 4ac 
2a 



(37) 



where a — 1 — pt > 0, b — 2Y t — 2lp t — Ptcto an d c = (Xf — 
&o)lpt < 0. If ao < Xf then c in the above equation is positive. 
Consequently, as x takes the value 1, the number of iterations of 
the loop on I is 0(n). Hence, the time complexity of the algorithm 
is 0(kn 2 ). However, the time complexity decreases sharply on 
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(a) Number of iterations with string length n (fc=2). 
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(b) Number of iterations with alphabet size k. 
Figure 1: Analysis of time complexity for finding the MSS. 

increasing an. Once an becomes sufficiently greater than Xf, the 
term c « —aolpt starts predominating 6, and x in each step is 
effectively c/a which is 0(\/aT)- 6 Hence, the recurrence relation 
of the number of iterations of the loop on / in this case is 



T(l + O(VaoI)) = T(l) + 1 



(38) 



It can be again shown with the help of Lemma 7 in the appendix 
that the solution to the recursive relation is O(\jl/cco). So the total 
time complexity of the algorithm is O(knyjn/a ). 

6.3 MSS Greater Than a Given Length 

The algorithm for finding the most significant substring among 
all substrings having length greater than a given length To is ex- 
actly the same as the MSS algorithm except that now we ignore 
any substring whose length is not greater than To. This means the 
loop on I starts with r instead of and loop on i goes on till n — To 
instead of n. The time complexity of the algorithm decreases not 
just because of less number of substrings evaluated in this case but 
also because the skip x in our algorithm is a function of / and it 
increases with increasing values of I. Hence, the recursive relation 
for the loop over / in this case is the same with only the base case 
different: T(T ) = 1 instead of T(l) = 1. The solution to this 
recurrence relation is 0(y/n — ^/To). Since there are n — To it- 
erations of loop in i, the total time complexity of the algorithm is 
0(k(n - r )(7n - v^o)) which is effectively 0(kn 3/2 ). 



In a substring generated from a memoryless Bernoulli distribu- 
2 distribution with constant mean and vari- 

' 2 is a 

small constant. 



tion, X 2 follows a 

ance. Hence, it can be shown with high probability that Xf 



X 



Our Algorithm 



7 

Ln n 



Figure 2: Variation of X max with string length n (k = 2). 
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Figure 3: X max and number of iterations for different multinomial 
strings. Si : n = 10 4 ,k = 3, P = {pn,0.5 — p ,0.5}; & : n = 
10 4 , k = 5, P = {p , 0.5 - po, 0.1, 0.2, 0.2}. 

7. EXPERIMENTAL ANALYSES AND AP- 
PLICATIONS 

The experimental results shown in this section are for C codes 
run on Macintosh platform on a machine with 2.3 GHz Intel dual 
core processor and 4 GB, 1333 MHz RAM. Each character of a syn- 
thetic string was generated independently from the underlying dis- 
tribution assumed using the standard uniform (0, 1) random num- 
ber generator in C. 

7.1 Synthetic Datasets 

7.1.1 Time Complexity of Finding MSS 

The first experiment is on the time complexity of our algorithm 
for finding the most significant substring. Figure la depicts the 
comparison of number of iterations required by our algorithm vis- 
a-vis the trivial algorithm for input strings of different lengths (n) 
generated from the null model for an alphabet of size 2. The num- 
ber of iterations of our algorithm when plotted on a logarithmic 
scale increases linearly with the logarithm of the string size with 
a slope close to 1.5. Hence, we can claim that the empirical time 
complexity of our algorithm for an input string generated by null 
model is also 0(n , ). 

The effect of varying the alphabet size is shown in Figure lb 
for different string lengths. It can be observed that, as expected, 
varying the alphabet size has no significant effect on the number of 
iterations of the algorithm. 

Figure 2 shows that the expected X'^^ increases linearly with n 
with slope ~ 2 which supports our claim in Lemma 4 that for suf- 
ficiently large n, X^ aa; is greater than ln n with high probability. 

Finally, Figure 3 plots the variation of X^^ and iterations of the 
loop over I for different heterogeneous multinomial distributions 
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String Length 
(a) Varying n(k = 5). 




Alphabet size 
(b) Varying k (n = 20000). 
Figure 4: Comparison of time taken by our algorithm on strings not 
generated by the null model. 

and different alphabet sizes. It is evident that change in the prob- 
ability po of occurrence of character ao only changes the X^ax 
but has no significant effect on the number of iterations taken by 
our algorithm. It can be intuitively seen that the change in p () is 
effectively canceled out by the change in X^ ax , so the number of 
characters skipped (x in Eq. (22)) roughly remains the same. 

7. 1.2 Strings Not Generated Using the Null Model 

We now investigate the results for input strings not generated 
from the null model in addition to an equivalent length input string 
generated from the null model which is a memoryless Bernoulli 
source where the multinomial probabilities of all the characters are 
equal. The different types of other strings that we compare are: 

(a) Geometric string: A string generated from a memoryless multi- 
nomial Bernoulli source but the multinomial probabilities of all 
the characters are different. The probability of occurrence of a 
character decreases geometrically. Hence, the probability of 
occurrence of character ai is proportional to l/2\ 

(b) Harmonic string: A string generated from a memoryless multi- 
nomial Bernoulli source but the multinomial probabilities of all 
the characters are different. The probability of occurrence of a 
character decreases harmonically. Hence, the probability of 
occurrence of character ai is proportional to 1/i. 

(c) Markov string: A string generated by a Markov process, i.e., 
the occurrence of a character depends on the previous charac- 
ter. The state transition probability of character a,j following 
character ai is proportional to 1 /2 (i ~ j) moik . 

The number of iterations for our algorithm on different values 
of string length (n) and alphabet size (k) are plotted in Figure 4. 
It can be verified that in all the cases, the string generated using 
the null model requires the maximum number of iterations which 
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(a) Number of iterations with string length n. 
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2 4 6 8 10 12 

Lnt 

(b) Number of iterations with t. 
Figure 5: Analysis of time complexity for finding the top-t set. 

is in accordance with our theoretical claim in Section 5. The time 
taken by our algorithm on an input string not generated from a null 
model is upper bounded by the time taken on an equivalent size 
input string generated from the null model. This verifies that the 
time complexity of our algorithm is 0{kn i ^ 2 ), independent of the 
type of the input string. 

7.2 Other Variants 

7. 2. 1 Top-t Significant Substrings 

The time taken by the algorithm for finding the top-t set on vary- 
ing string lengths for different values of t is shown in Figure 5a. 
The linear increment in logarithmic scale with slope ~ 1.5 verifies 
that for any constant t the time taken by our algorithm to find the 
top-t set is again 0((k + log t)n 1,5 ). 

The time taken for different t is shown in Figure 5b. The plot 
shows that till t < cj(n), the running time increases with slope 1.5, 
but once t crosses the limit, the slope starts increasing towards 2. 
This is agreement with our theoretical analysis in Section 6.1. 

7.2.2 Significance Greater Than a Threshold 
Figure 6 depicts the number of iterations taken by the algorithms 

for finding all substrings greater than a threshold ao. As discussed 
in Section 6.2, the iterations decrease very sharply from 0(n 2 ) un- 
til Qo = O (X^ax ) after which it gradually decreases (as a function 
of l/^/o^). 

7.2.3 Substrings Greater Than a Given Length 

The number of iterations taken by the algorithms for finding the 
MSS among all strings of length greater than To is shown in Fig- 
ure 7. As discussed in Section 6.3, the number of iterations slowly 
decreases as To tends to n before rapidly approaching 0. 
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Figure 6: Number of iterations with ao (n = 10°, k = 2). 
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Figure 7: Number of iterations with To (n = 10 5 , k = 2). 

7.3 Comparison with Existing Techniques 

Table 1 presents the comparative results of our algorithm with 
the existing algorithms [13] for two different values of string size 
(averaged over different runs). As expected, results indicate that 
ARLM [13], being 0(n 2 ), does not scale well for larger strings, 
as opposed to our algorithm. AGMM [13], being 0(n) time, is 
very fast and outperforms all the algorithms in terms of time taken. 
However, being just a heuristic with no theoretical guarantee, it 
does not always lead to a solution that is close to the optimal. As 
can be verified from Table 1, the average X^ ax string found by 
AGMM is significantly lower than the average X^,„ value found 
by other algorithms. Further, since there are no guarantees on the 
lower bound of the lj, al value found by it relative to the opti- 
mal X^ax value, AGMM can lead to pretty bad solutions in some 
real datasets which are not as well behaved as the synthetic ones 
(Section 7.5). Finally, our algorithm requires only 3 seconds for a 
string as large as of length 80000 which signifies that for real life 
scenarios, the algorithm is practical. 

7.4 Application in Cryptology 

The correlation between adjacent symbols is of central impor- 
tance in many cryptology applications [12]. The objective of a ran- 
dom number generator is to draw symbols from the null model. The 
independence of consecutive symbols is an important criterion for 
efficiency of a random number generator [12]. We define correla- 
tion between adjacent symbols in terms of the state transition prob- 
ability. An ideal random binary string generator should generate 
the same symbol in next step with probability exactly 0.5. How- 
ever, some random number generators which are inefficient might 
be biased towards generating the same symbol again with proba- 
bility more than 0.5. Table 2 shows the comparison of X^,„ for 
different lengths n of string and different probabilities of genera- 
tion of same symbol p in the next iteration. 



Algo 


String Size 


Ave X'i, „ _ 


Avg Time 


Trivial 


90000 


18.69 


8.54s 


Our 


20000 


18 69 


0.5s 


ARLM 


20000 


18.69 


1.9s 


AGMM 


20000 


15.10 


0.01s 


Trivial 


80000 


20.35 


142.21s 


Our 


80000 


20.35 


2.82s 


ARLM 


80000 


20.32 


39.22s 


AGMM 


80000 


17.71 


0.03s 



Table 1: Comparison with other techniques for synthetic datasets. 



X 2 

max 


p = 0.50 


p = 0.55 


p = 0.60 


p = 0.80 


n = 1000 


12.18 


14.24 


16.80 


36.47 


n = 5000 


15.12 


17.67 


21.52 


48.79 


n = 10000 


16.87 


19.36 


24.03 


53.37 


n = 20000 


17.89 


21.48 


25.70 


60.61 



Table 2: Variation of X^ ax with n and p. 

It can be verified from the data that the Xj,„ is minimum for 
a string generated with p = 0.5 and increases with increasing p. 
Further, Figure 4 plots the variation of X^ ax of a string generated 
using the null model with (logarithm of) the string length (Inn). 
We observe a nice linear convergence with slope 2. This X^ ax 
value can be used as a benchmark for a string of any length to mea- 



of a string deviates significantly from the benchmark, it means that 
the string generated is not completely random but contains some 
kind of hidden correlation among the symbols. One of the major 
advantages of using the algorithm is in a scenario where only a 
substring of a string might deviate from the random behavior. Our 
algorithm will be able to capture such a substring without having 
to examine all the possible substrings 7 . 

7.5 Real Datasets 

7. 5. 1 Analysis of Sports Data 

The chi-square statistic can be used to find the best or worst 
career patches of sports teams or professionals. Boston Red Sox 
versus New York Yankees is one of the most famous and fiercest 
rivalries in professional sports [11]. They have competed against 
each other in over two thousand Major League Baseball games 
over a period of 100 years. Yankees have won 1132 (54.27%) of 
those games. However, we would like to analyze the time peri- 
ods in which either of Yankees or Red Sox were particularly dom- 
inant against the other. The dominant periods should have large 
win ratio for a team over a sufficiently long stretch of games. If 
we encode the results in the form of a binary string whose letters 
denote a win or loss for a team, then these sufficiently long peri- 
ods will contain results that significantly differ from the expected 
or average. Consequently, the X 2 value for the dominant periods 
will significantly differ from 0. We use the dataset obtained from 
www.baseball-reference.com. 

The top five most significant patches found by our algorithm 
have been summarized in Table 3. The best period for Yankees 
was from mid 1920s to early 1930s in which they won more than 
75% of the games. It was clearly the era of Yankees dominance in 
which they won 26 World Series championships and 39 pennants, 
compared to only 4 pennants for the Red Sox [11]. Alternatively, 
the best patch for Red Sox was a two-year period around 1912 in 
which they had close to 90% winning record; this is also referred 
to as the glory period in Red-Sox history [11]. 



7 Such substrings will tend to exhibit large X 2 values and, hence, 
will be captured by our algorithm. 
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Start 


1LI1U 


vol 


dames 


Wine 

wins 


Win CL. 

win /c 


17-04-1924 


06-06-1933 


38.76 


204 


155 


75.98% 


05-09-1911 


01-09-1913 


26.99 


39 


5 


12.82% 


02-05-1902 


27-07-1903 


16.93 


27 


4 


14.81% 


08-02-1972 


28-07-1974 


16.56 


35 


7 


20.00% 


10-07-1960 


07-09-1962 


12.05 


42 


34 


80.05% 



Table 3: Performance of Yankees against Red-Sox. 



Algorithm 


X' 1 val 


Start 


End 


Time 


Trivial 


38.76 


17-04-1924 


06-06-1933 


0.142s 


Our 


38.76 


17-04-1924 


06-06-1933 


0.036s 


ARLM 


38.76 


17-04-1924 


06-06-1933 


0.032s 


AGMM 


26.99 


05-09-1911 


01-09-1913 


0.011s 



Table 4: Comparison with other techniques for sports data. 



The comparative results of our algorithm with existing techniques 
are summarized in Table 4. As expected, our algorithm and AGMM 
finds the optimal solution but our algorithm outperforms the trivial 
algorithm and is almost as good as ARLM in terms of time (due 
to relatively small string size). Moreover, though AGMM is faster, 
it does not find the optimal solution. The best period found by 
AGMM was the second best (see Table 3) and has a significantly 
lower X 2 value. 

7. 5. 2 Analysis of Stock Returns 

Most financial models are based on the random walk hypothesis 
which is consistent with the efficient-market hypothesis [6]. They 
assume that the stock market prices evolve according to a random 
walk with a constant drift and, thus, the prices of the stock market 
cannot be predicted. 8 

We analyze the returns of three generic financial securities for 
which a long historical data is available. The Dow Jones Indus- 
trial Average is one of the oldest stock market index that shows the 
performance of 30 large publicly owned companies in the United 
States. Similarly, S&P 500 is another large capitalization-weighted 
index that captures the performance of 500 large-cap common stocks 
actively traded. Finally, the IBM common stock is representative 
of one of the oldest and largest publicly owned firms. We run the 
algorithms on the Dow Jones prices obtained since the year 1928 
onwards (20906 days), S&P 500 since 1950 onwards (15600 days) 
and IBM since 1962 onwards (12517 days). The daily price data 
are obtained from finance.yahoo.com. 

Given the randomness in the stock prices, we assume that the 
prices can increase (or decrease) each day with a fixed probability. 
The fixed probability is calculated as the ratio of days on which 
price went up (or down) to the total number of trading days. We 
find the statistically significant substrings of the binary string en- 
coded with 1 for the day if the price of security went up and oth- 
erwise. These substrings correspond to significantly long periods 
that contain a large ratio of days in which the stock price changed. 
The results are summarized in Table 5. 

A lot of bad periods occurred during the Great Depression of 
1930s, the recent dot-com bubble burst and mortgage recession pe- 
riods of the last decade, whereas a number of good periods occurred 
during the economic boom of 1950s and 1960s. These observations 
verify that these statistically significant periods do not occur just 
due to randomness or chance alone, but are consequences of exter- 
nal factors as well. The identification of such significant patterns 
can help in identifying the relevant external factors. Finally, the 
X 2 values of these substrings can also be used in quantifying the 
historical risk of the securities which is one of the most important 
parameters that investment managers like to control. 

8 If the stock prices can be predicted then there is an arbitrage in the 
market which violates the efficient market hypothesis. 



Periods 


Security 


Start 


End 


Change 


Good 


Dow Jones 


24-02-1954 


06-12-1955 


68.10% 


Dow Jones 


25-06-1958 


f\A AO i ncn 


43.52% 


S&P 500 


15-09-1953 


20-09-1955 


9 1.07% 


S&r 500 


An i o i nrM 

09-12-1994 


n (\c i aac 

1 /-05-1995 




IBM 


13-08-1970 


06-10-1970 


37.60% 


IBM 


26-10-1962 


26-01-1968 


OCT l\(Yl 

252.1)% 


Bad 


Dow Jones 


27-02-1931 


04-05-1932 


-71.17% 


Dow Jones 


19-09-1929 


14-11-1929 


-41.27% 


S&P 500 


26-10-1973 


21-11-1974 


-39.79% 


S&P 500 


05-09-2000 


12-03-2003 


-46.24% 


IBM 


31-03-2005 


20-04-2005 


-21.20% 


IBM 


22-02-1973 


13-08-1975 


-46.91% 



Table 5: Significant periods for the securities. 



Algo 


Sec. 


X 2 


Start 


End 


Change 


Time 


Trivial 


Dow 


25.22 


24-02-54 


06-12-55 


68.1% 


14.10s 


Our 


Dow 


25.22 


24-02-54 


06-12-55 


68.1% 


0.89s 


ARLM 


Dow 


25.22 


24-02-54 


06-12-55 


68.1% 


4.15s 


AGMM 


Dow 


19.53 


24-01-66 


09-04-85 


325.0% 


0.03s 


Trivial 


S&P 


22.21 


26-10-73 


21-11-74 


-39.79% 


9.36s 


Our 


S&P 


22.21 


26-10-73 


21-11-74 


-39.79% 


0.63s 


ARLM 


S&P 


22.21 


26-10-73 


21-11-74 


-39.79% 


2.87s 


AGMM 


S&P 


13.44 


22-04-66 


09-05-66 


-6.44% 


0.03s 



Table 6: Comparison with other techniques for stock returns. 



The comparative performance of our algorithm vis-a-vis the other 
techniques in finding the period with the highest X 2 value is sum- 
marized in Table 6. Again, as expected, our algorithm, trivial algo- 
rithm and ARLM find the same period for which the X 2 value is 
maximized. However, in this case, the time performance advantage 
of our algorithm over ARLM is pretty apparent. AGMM, though 
having the time advantage, does pretty badly in terms of identifying 
the maximum X 2 substring. Especially for S&P 500, it returns a 
substring that is not even close to the top few substrings. 

8. CONCLUSIONS AND FUTURE WORK 

In this paper, we chose to analyze the X 2 statistic in the context 
of a memoryless Bernoulli model. We experimentally saw that for 
a string drawn from such a model, the chi-square value of the most 
significant substring increases asymptotically as (2 In n) where n is 
the length of the string. However, the rigorous mathematical proof 
remains an interesting open problem. Such analysis of asymptotic 
behavior have significant applications in deciding the confidence 
interval with which the null hypothesis is rejected. Further, the 
analysis can be further extended to strings generated from Markov 
models, the most basic of which being the case when there is a 
correlation between adjacent characters. 

The single dimensional problem of identification of the most 
significant substring can be extended to two-dimensional grid net- 
works as well as general graphs. One potentially interesting ap- 
plication is in financial time series analysis of two securities that 
might not be very correlated in general, but might point to signif- 
icant correlations during certain specific events such as recession. 
Such correlations are essential to most risk analysis techniques. 
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APPENDIX 

LEMMA 7. The solution to the recursive relation T( \l + c\/T} ) 
< T(l) + 1 with T(a) = lfor a <= 1 where I is a positive integer 
is 0(V~l/c). More specifically, T{1) < ^ + c 2 . 

PROOF. We prove this by induction. The base cases for / < 
c 2 are trivially satisfied. Further, for any positive integer I > c 2 , 
assume that T(l) < ^ + c 2 is true for all positive integers r such 
that c 2 <r < \l + c\fl~\ . Hence, 

T(\l + cVl])<T(l) + l 



< h C + 1 = A 

\ 

VlGl + 8cVl + c 2 



+ c 



+ c 



Vl6l + 9cVl ^2, 2 . n 
< h c [v c < l\ 



4VZ + cl 2 
< he 



(39) 

□ 



LEMMA 8. In the algorithm for finding the top-t substrings, for 
any constant e and t < uj(n), X^ axt < Inn with probability at 
least l — e. 

PROOF. Let Z max denote the t max of in i.i.d. random vari- 
ables following x 2 (fc) distribution. As in the X^, al case in Algo- 
rithm 1 , since asymptotic behavior of \ 2 (k) distribution is same for 
all k, we again prove it only for k = 2, which is sufficient. Again, 
f(x) and F(x) denote the pdf and cdf of x 2 (2) distribution: 



We have 

Z ma x-t = maxJ{Z 1 , Z 2 , 



F(x;2) = 1 - e 



-x/2 



(40) 



,Z m }\fi,Z^ x 2 (k) (41) 



Now for each Zi, we define a new Bernoulli random variable Yi 
which takes the value 1 if Zi > In m and otherwise: 



Pr{Y, = 1} = Pr{Z, > mm} = e"5 ln(m) 



(42) 



Let Y = Y2T=i men ^ follows binomial distribution with prob- 
ability of success p — Further, 



Pr{Z max _ t > mm} = Pr{Y > t] 



(43) 



Using the Chernoff 's inequality for binomial distribution , for any 
constant e > 0, 



(mp-t) 2 

Pr{Y > t} > 1 - e 2 mp > i 



(44) 



If t < uj(i/m), we can effectively ignore t in the above equation. 
In that case, the above equation simplifies to 



Pr{Z max _ t > In m} w 1 - e 2 >l-e 



(45) 



Finally, again as in Algorithm 1, at least 0(n) substrings are inde- 
pendent. Therefore, the result holds. □ 
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